Checking installation and loading packages

As usual we first always check and load in our required packages.

# Check if packages are installed, if not install.
if(!require(here)) install.packages('here') #checks if a package is installed and installs it if required.
if(!require(tidyverse)) install.packages('tidyverse')
if(!require(ggplot2)) install.packages('ggplot2')

library(here) #loads in the specified package
library(tidyverse)
library(ggplot2)

Testing between groups

This week, we are adding to our data analysis toolkit with a between groups analysis, using an independent samples t-test.

We observed last week how mood impacts active social media behaviour. However, that is not the only factor that influences social media use. For example, Sapienza et al (2023) found that people in rural areas are more likely to use their smartphone for social media and gaming, whereas urban dwellers are more likely to use their phone for navigation and business.

However, we do not know if people living in urban and rural areas engage with social media differently, regardless of how long they spend on their chosen platforms. Today we will address this question using the urban, good_mood_likes, bad_mood_likes, and followers variables.


Activity 1 - Formulate your research question

What do you think? Will urban and rural dwellers engage differently with social media? Will there be a difference in the number of likes made by people living in urban vs rural areas? Or in the number of followers people have in urban vs rural areas?

Question: What are the null hypotheses for your research questions (you should have one for ‘likes’ and one for ‘followers’? What would you expect to see if your prediction is correct? Discuss this with your neighbor and your tutor.
Hint: We are going to average over the effect of mood, so we do not need to include mood in our predictions about likes.

Activity 2 - Creating our likes variable

Today we will be averaging across mood to get the number of likes in general for urban and rural dwellers. This means we first need to create a new variable called likes which is the average of the likes in a good and bad mood.

We first load in ourPSYC2001_social-media-data.csv dataset.

social_media <- read.csv(file = here("Data","PSYC2001_social-media-data.csv")) #reads in CSV files

Are you able to fill the code below using the mutate() function to create this new ‘likes’ variable?

social_media_likes <- social_media %>% 
  mutate(likes =(bad_mood_likes + good_mood_likes)/2 ) %>% #create a new column with specified values
  select(id, urban, likes, followers) #keep specified columns in dataframe

head(social_media_likes)
##   id urban likes followers
## 1 S1     1 34.65     173.3
## 2 S2     1 47.15     144.3
## 3 S3     1 48.45      76.5
## 4 S4     1 29.55     171.7
## 5 S5     1 44.75     109.5
## 6 S6     1 23.55     157.5

Wrangling our data

Now that we have this object it is important to check the format of the data. Lets use the str() function that we learned about in the second tutorial to do this.

str(social_media_likes) #provides a summary of the data structure.
## 'data.frame':    60 obs. of  4 variables:
##  $ id       : chr  "S1" "S2" "S3" "S4" ...
##  $ urban    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ likes    : num  34.6 47.1 48.5 29.5 44.8 ...
##  $ followers: num  173.3 144.3 76.5 171.7 109.5 ...

First we can see that having the values in urban are coded as either 1 (urban) or 2 (rural). Lets change this so that instead of using numbers we use the actual descriptions of urban and rural. To do this we will use the mutate() function with the case_when() function which replaces (or creates) specific values in a variable with new ones.

social_media_likes <- social_media_likes %>% 
  mutate(urban = case_when(urban == 1 ~ "urban", urban == 2 ~ "rural")) #case_when uses if_else logic to replace values with specified values if the cases match.

str(social_media_likes)
## 'data.frame':    60 obs. of  4 variables:
##  $ id       : chr  "S1" "S2" "S3" "S4" ...
##  $ urban    : chr  "urban" "urban" "urban" "urban" ...
##  $ likes    : num  34.6 47.1 48.5 29.5 44.8 ...
##  $ followers: num  173.3 144.3 76.5 171.7 109.5 ...

We can see that urban is now classed as a chr (character) but we will eventually need to split our graphs by urban. This means that urban should be a factor instead. We can change this easily by using as.factor() within the mutate() function. The as.factor() function is used to convert other datatypes to factors !

social_media_likes <- social_media_likes %>% 
  mutate(urban = as.factor(urban))

str(social_media_likes)
## 'data.frame':    60 obs. of  4 variables:
##  $ id       : chr  "S1" "S2" "S3" "S4" ...
##  $ urban    : Factor w/ 2 levels "rural","urban": 2 2 2 2 2 2 2 2 2 2 ...
##  $ likes    : num  34.6 47.1 48.5 29.5 44.8 ...
##  $ followers: num  173.3 144.3 76.5 171.7 109.5 ...

The data is now in a format that we should be able to easily visualise it and conduct our statistical tests. Well done !

Figure 1: What it feels like teaching this section

Figure 1: What it feels like teaching this section


Activity 3 - Visualising our data in general

We’re now going to look at the data in 2 ways. First, we’re going to look at how the data is distributed across all participants, so that we can check if the data meets our assumptions about normality. Second, we are going to plot our dependent variables (likes, followers) by group, to gain a visual understanding for what group differences might look like, if they exist.

Are you able to create a density plot for likes and another for followers? Note that we use a new argument here linewidth to control the size of the density line.

social_media_likes %>% 
  ggplot(aes(x = likes)) +
  geom_density(linewidth = 2, colour = "blue") + #the argument linewidth is used to alter the size of the density line. 
  labs(x = "Number of Likes", y = "Density") +
  theme_classic() 

social_media_likes %>% 
  ggplot(aes(x = followers)) + 
  geom_density(linewidth = 2, colour = "orange") +
  labs(x = "Number of followers", y = "Density") +
  theme_classic() 

Question: Do likes and followers look normally distributed to you? Why might the data be shaped how they are for each variable?
Question: What impact does your new understanding of the data have on your analysis, if any?

Activity 4 - Visualising group differences

Now we are going to make some density plots and boxplots, split by the urban factor so that we can see the group differences.

Are you able to help with this?

Hint: Use the colour argument in aes() to split the plot by Urban
social_media_likes %>% 
  ggplot(aes(x = likes, colour = urban)) +
  geom_density(linewidth = 2) +
  labs(x = "Number of Likes", y = "Density") +
  scale_colour_manual(values = c(rural = "purple", urban = "green")) + #manually define colours of specific parts of a graph
  theme_classic() 

social_media_likes %>% 
  ggplot(aes(x = followers, colour = urban)) +
  geom_density(linewidth = 2) +
  labs(x = "Number of Followers", y = "Density") +
  scale_colour_manual(values = c(rural = "purple", urban = "green")) + #manually define colours of specific parts of a graph
  theme_classic() 

Info: We are using a new function here called scale_colour_manual(). This allows you manually define the colours of specific parts of a graph. Here we have used it to define colours for specific groups.


social_media_likes %>% 
  ggplot(aes(y = likes, colour = urban, x = urban)) +
  geom_boxplot() +
  labs(x = " ", y = "Number of Likes") +
  scale_colour_manual(values = c(rural = "purple", urban = "green")) + #manually define colours of specific parts of a graph
  theme_classic() 

social_media_likes %>% 
  ggplot(aes(y = followers, colour = urban, x = urban)) +
  geom_boxplot() +
  labs(x = "", y = "Number of Followers") +
  scale_colour_manual(values = c(rural = "purple", urban = "green")) + #manually define colours of specific parts of a graph
  theme_classic() 

Question: What do you think the data suggest about the group differences for likes and followers? Was it in-line with your predictions from activity 1? Are there any caveats or reasons to be cautious about your interpretations?

Independent samples t-test

We now want to learn whether we have evidence for differences between urban and rural dwellers on the ‘likes’ and ‘followers’ variables.


Activity 5 - Undertaking an independent t-test

Can you work out how to perform an independent samples t-test for these variables ?

Hint: Use the t.test() function from the tutorial last week and pay attention to the ‘paired’ argument !
t.test(formula = likes ~ urban, data = social_media_likes, var.equal = TRUE, paired = FALSE)
## 
##  Two Sample t-test
## 
## data:  likes by urban
## t = 3.2184, df = 58, p-value = 0.002112
## alternative hypothesis: true difference in means between group rural and group urban is not equal to 0
## 95 percent confidence interval:
##   4.273656 18.336344
## sample estimates:
## mean in group rural mean in group urban 
##            52.09333            40.78833
t.test(formula = followers ~ urban, data = social_media_likes, var.equal = TRUE, paired = FALSE)
## 
##  Two Sample t-test
## 
## data:  followers by urban
## t = -2.8182, df = 58, p-value = 0.006595
## alternative hypothesis: true difference in means between group rural and group urban is not equal to 0
## 95 percent confidence interval:
##  -65.40138 -11.07862
## sample estimates:
## mean in group rural mean in group urban 
##            105.6367            143.8767
Question: Discuss the output of this independent t-tests, what does it tell you about the differences between urban and rural dwellers and how they use social media ? Is it what you expected when you formulated your hypotheses?
Figure 2: Exams are hard

Figure 2: Exams are hard


Extension - Visualising our confidence intervals ?

This section is an extension activity if you have already finished the required materials. Please check with your tutor that you have a good grasp of the material before moving onto this section.

When performing statistical tests in the real world we sometimes want to visualise the range of our confidence intervals to inform us of the precision of our inferential estimates. Lets also include both measures of central tendency, the median and the mean on these visualisations.

Lets first extract the confidence interval and mean difference from our t.test() function for followers measure.

results <- t.test(formula = followers ~ urban, data = social_media_likes, var.equal = TRUE, paired = FALSE)

confidence_interval <- abs(results$conf.int) # extract confidence interval

CI_upper <- confidence_interval[1]
CI_lower <- confidence_interval[2]

mean_difference <- abs(results$estimate[1] - results$estimate[2]) # extract and calculate mean difference 

Now lets calculate the median as this is not produced by the t.test() function.

median_difference <- social_media_likes %>% 
  group_by(urban) %>% 
  summarise(median_followers = median(followers)) %>%
  summarise(diff = diff(median_followers)) %>%
  pull(diff)

Now lets combine all this data into a nice dataframe

plot_data <- data.frame(
  mean_difference = c(mean_difference),
  median_difference = c(median_difference),
  lower = c(CI_lower),
  upper = c(CI_upper)
)

Lets now plot this using ggplot

# Plot
ggplot(data = plot_data, aes(x = "followers")) +
  geom_point(aes(y = mean_difference), size = 4, colour = "green") +
  geom_point(aes(y = median_difference), size = 4, colour = "red") +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1, na.rm = TRUE) + #use to plot errors bars. Takes in two main arguments the upper and lower boundary of the error bar (ymin and ymax respecitvely !)
  labs(
    x = NULL,
    y = "Difference in Followers",
    title = "Mean and Median Differences in Followers by Urban Group",
    caption = "Mean includes 95% CI; Median shown without CI"
  ) +
  theme_minimal()

Question: How do these plots relate to the cavaets discussed earlier?

Well done ! This computing tutorial is now over. Make sure to thank your tutor for another amazing class full of wonderful statistics and learning !

Figure 3: Everyone loves statistics ?)

Figure 3: Everyone loves statistics ?)